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tematically reviewed how CT has been assessed in the literature. We reviewed 96 journal articles 
to analyze specific CT assessments from four perspectives: educational context, assessment 
construct, assessment type, and reliability and validity evidence. Our review results indicate that 
(a) more CT assessments are needed for high school, college students, and teacher professional 
development programs, (b) most CT assessments focus on students' programming or computing 
skills, (c) traditional tests and performance assessments are often used to assess CT skills, and 
surveys are used to measure students' CT dispositions, and (d) more reliability and validity ev- 
idence needs to be collected and reported in future studies. This review identifies current research 
gaps and future directions to conceptualize and assess CT skills, and the findings are expected to 
be beneficial for researchers, curriculum designers, and instructors. 


1. Introduction 


Computational Thinking (CT) has drawn increasing attention in the field of science, technology, engineering, and mathematics 
(STEM) education since Wing (2006) promoted it. Cuny, Snyder, and Wing (2010) defines CT as “the thought processes involved in 
formulating problems and their solutions so that the solutions are represented in a form that can be effectively carried out by an 
information-processing agent (Wing, 2011, p. 1, p. 1)". Not only is CT grounded on concepts fundamental to computer science (CS), a 
field that has dramatically impacted society (Wing, 2006; 2008), but it is integral to modern research and problem-solving work of 
STEM (Henderson, Cortina, & Wing, 2007). Thus, CT should be embedded in the educational system as a substantial learning goal to 
prepare students with competency in their future life (Grover & Pea, 2013). The International Society for Technology in Education and the 
Computer Science Teachers Association (CSTA & ISTE, 2011) developed resources on approaches to bringing CT into K-12 settings. 
Meanwhile, the National Research Council (NRC) organized two workshops with scholars in CS and education around the scope and 
nature of CT and the pedagogical aspects of CT (National Research Council, 2010; 2011; 2012). These efforts together with others in 
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the education community promoted a milestone in STEM education in 2013—the Next Generation Science Standards (NGSS Lead 
States, 2013) listed “using mathematics and computational thinking” as one of the Science and Engineering Practices that integrate 
both disciplinary core ideas and crosscutting concepts. 

Drawing upon the growing interest in integrating CT in STEM education, the field dedicated many efforts to promote and examine 
students’ CT skills. These efforts include, but are not limited to the following ones: (a) developing CT-integrated curriculum (e.g., Rich, 
Spaepen, Strickland, & Moran, 2019; Sung, 2019); (b) inventing CT-inspired teaching and learning tools (e.g., Bers, 2010; Grover, 
2017a, 2017b; Weintrop et al., 2014); (c) building CT-embedded learning environment (e.g., Munoz-Repiso & Caballero-Gonzalez, 
2019), and (d) developing assessments focusing on students’ CT skills (e.g., Gonzalez, 2015; Korkmaz, Cakir, & Ozden, 2017a, 2017b). 
These studies generated a body of literature that help us understand the nature of CT, the CT integration in STEM classrooms, and the 
features of students’ performance in CT practices. 

Some researchers have synthesized the work related to CT. Lockwood and Mooney (2018) summarized CT research in secondary 
education and provided information on the subjects used to teach CT, the tools used to teach and assess CT, and benefits and barriers of 
incorporating CT in secondary education. Hsu, Chang, and Hung (2018) explored teaching and learning activities and strategies when 
promoting CT. However, systematic reflection is lacking on the evaluation tools of student CT skills and performance used in these 
studies, making future research directions on promotion and assessment of CT practices unclear. In fact, both Lockwood and Mooney 
(2018) and Hsu et al. (2018) rendered it pressing to address the education community’s uncertainty in how to best assess CT skills in 
their reviews. 

Our study aims to systematically review CT assessment research in more detail than previous reviews regarding CT implementation 
contexts, CT and CT-related constructs, CT assessment tools, and their reliability and validity evidence across all educational levels. In 
particular, our review focuses specifically on the CT studies that apply assessments for the educational levels from kindergarten to 
college education and professional development for teachers. In addition, we reviewed these studies in terms of what educational 
contexts and subject domains these studies are conducted at, what types of the CT constructs are measured (we classified the constructs 
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in terms of CT and CT-related learning outcomes), what assessment tools have been employed to measure these types of CT constructs, 
and finally what the reliability and validity evidence are being reported. To sum, Lockwood and Mooney (2018) provided a big picture 
of current CT studies in secondary education, while our review contributes more detailed information regarding the development and 
implementation of CT assessments. 


1.1. Definition and significance of CT 


In 1980, Seymour Papert first used the term “computational thinking” and suggested that computers might enhance thinking and 
change patterns of knowledge accessibility (Papert, 1980). He further emphasized that all children should have access to computers as 
a way to shape their learning and express their ideas (Papert, 1996). Later, in her influential article, Wing (2006) echoed these ideas by 
illuminating the concept of CT and broadcasting its applications in problem solving. She stated that CT should be at the core of K-12 
curricula and called for research on effective ways of teaching CT to students. Since then, CT has drawn increasing attention from 
educators and educational researchers and has been identified as a critical competence that would equip students with foundational 
skills to learn STEM (Weintrop et al., 2014). 

Despite of a research history of around 15 years, the field has not reached a consensus on the definition of CT (National Research 
Council, 2011). As such, we constructed a diagram to illustrate some of the well-cited definitions (See in Fig. 1). As shown in the figure, 
many researchers defined CT in a way of drawing from programming and computing concepts. For example, Brennan and Resnick 
(2012) developed a theoretical framework of CT that involves three key dimensions, in which one of them refers to computational 
concepts, including programming terms of sequences, loops, parallelism, events, conditionals, operators, and data. The other two 
dimensions are computational practices, including the processes of iteration, debugging, and abstraction, and computational per- 
spectives, including expressing, connecting, and questioning. Another defining framework that originates from computing concepts 
were proposed by Weintrop et al. (2016). They classified CT into four major categories with 22 sub-skills: data practices, modeling & 
simulation practices, computational problem-solving practices, and systems thinking practices. Based on this framework, they 
developed a series of CT enhanced lesson plans for high school STEM classrooms. Denner, Werner, and Ortiz (2012) defined CT as a 
united competence, which is composed of three key dimensions of CT: programming, documenting and understanding software, and 
designing for usability. 

Different from the skills in working with computing or programming activities, some researchers regarded CT as a set of compe- 
tences requiring students to develop both domain-specific knowledge and problem-solving skills. For example, CSTA & ISTE, 2011 
provided a list of vocabularies for CT: algorithms & procedures, automation, simulation, parallelization, algorithms and procedures, 
automation, simulation, parallelization. They suggested that those skills can be used to solve problems in everyday life, different 
subject domains, and across different grade levels. Selby and Woollard (2013) proposed operational definitions of CT skills including 
abstraction, decomposition, algorithmic thinking, evaluation, and generalization. As this operational definition is based on a 
meta-analysis of various studies of CT, it is broadly adopted in many studies (e.g., Atmatzidou & Demetriadis, 2016; Leonard et al., 
2018). During an experiment that assessed the impact of CT modules on preservice teachers, Yadav, Mayfield, Zhou, Hambrusch, and 
Korb (2014) explained five CT concepts: problem identification and decomposition, abstraction, logical thinking, algorithms, and 
debugging with concrete examples from day-to-day life and related these concepts to preservice teachers’ personal experiences. 

Despite the controversies surrounding the definition of CT, researchers have agreed that involving CT has an immense potential to 
transform how we approach subject domains in classrooms (e.g., Barr, Harrison, & Conery, 2011; Barr & Stephenson, 2011; Jona et al., 
2014; Lee et al., 2011; Repenning, Webb, & Ioannidou, 2010; Wing, 2006). Weintrop et al. (2016) concluded with three main benefits 
of embedding CT into STEM classrooms: building a reciprocal connection between math, science, and CT; constructing a more 
accessible classroom context for teachers and students; and making math and science classrooms updated with current professional 
practices. 

In addition to the theoretical discussion, many studies have attempted to integrate CT in classrooms. For example, Malyn-Smith and 
Lee (2012) have facilitated the exploration of CT as a foundational skill for STEM professionals and how professionals engaged CT in 
routine work and problem solving. Later, Lee, Martin, and Apone (2014) integrated CT into K-8 classrooms through three types of 
fine-grained computational activities: digital storytelling, data collection and analysis, and computational science investigations. 
Further, researchers have developed CT interventions on various subject domains such as biology and physics (Sengupta, Kinnebrew, 
Basu, Biswas, & Clark, 2013), journalism and expository writing (Wolz, Stone, Pearson, Pulimood, & Switzer, 2011; Wolz, Stone, 
Pulimood, & Pearson, 2010), science in general (Basu, Biswas, & Kinnebrew, 2017a, 2017b; Basu et al., 2016; Weintrop et al. 2016), 
sciences and arts (Sáez-López, Román-González, & Vázquez-Cano, 2016a, 2016b), and mathematics (Wilkerson-Jerde, 2014). 

Taken together, researchers have illustrated the importance of teaching and learning CT and its integration in other subject do- 
mains. However, as indicated by Kalelioglu, Gülbahar, and Kukul (2016) in their review on CT studies, the current CT concept and 
definition lacks scientific justifications. Not surprisingly, researchers would hold different perspectives when applying, interpreting, 
and assessing the proposed CT concept and definition. 


1.2. CT assessment 


Assessment plays a critical role when educators introduce CT into K-12 classrooms (Grover & Pea, 2013). Kalelioglu et al. (2016) 
also advocated to have more discussions on how to assess students' mastery and application of CT skills in real-life situations. In this 
study, we categorized CT assessments according to McMillan's (2013) paradigms of classroom assessment. Some CT studies employed 
selected-response and/or constructed-response tests, e.g., Shell and Soh (2013a, 2013b) developed a paper-pencil test to assess college 
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students’ CS knowledge and CT skills; also, Chen et al. (2017a, 2017b) designed an instrument with 15 multiple-choice questions and 
eight open-ended questions to assess students’ application of CT skills to solve daily life problems. Performance or portfolio assessment 
is another major assessment tool. Researchers created programming or CT activities for students to complete and then employed a 
grading rubric to evaluate their work products, e.g., the Fairy Assessment of Alice programming, an environment which allows users to 
build 3D virtual worlds (Werner, Denner, Campe, & Kawamoto, 2012); the analysis of digital portfolios designed by students to 
complete e-textiles projects using CT (Fields, Shaw, & Kafai, 2018; Lui et al., 2019), and the evaluation of students’ Scratch projects 
based on a visual programming environment (Garneli & Chorianopoulos, 2018). Questionnaires and interviews have also been used. 
For instance, Sáez-López et al. (2016a, 2016b) used a questionnaire to examine primary school students’ perceptions of computational 
concepts after learning visual programming language on the Scratch platform. 

Researchers have examined the quality of CT assessment. For example, to examine the reliability and validity of a self-efficacy 
perception scale for CT skills, Giilbahar, Kert, and Kalelioglu (2018) conducted exploratory and confirmatory factor analysis. Wein- 
trop et al. (2016) conducted interviews to probe students’ strategy on designing video games using block-based programming 
language. 

A systematic review of specific CT assessments may yield insights into ways of improving and designing effective assessment tools 
so that two goals can be achieved: (a) the scholarly community and school practitioners can remain updated in the availability of CT 
assessments and their characteristics. (b) Researchers can continually add to this scholarship by investigating unexamined but 
important topics surrounding CT assessment, the most relevant constructs in CT assessment, appropriate assessment formats to 
measure CT, and the necessary reliability and validity evidence. 

In this study, we conducted a systematic review with the purposes to reflect on prior studies and to identify the gaps by specifically 
focusing on one aspect of the literature—the assessment of CT. We analyzed the current state and features of studies exploring CT 
assessments and suggested future directions regarding how to assess CT for various purposes. The following research questions (RQ) 
formed the basis of this review: 


RQ1: What are the educational contexts to which CT assessments have been applied? 
RQ2: What CT constructs are measured by the CT assessments? 

RQ3: What assessment tools are used to assess CT constructs? 

RQ4: What is the reliability and validity evidence of these CT assessments? 


2. Method 
2.1. Literature search 


We searched three widely used and comprehensive digital databases to ensure the search covered all the relevant literature and 
journals: Education Resources Information Center (ERIC) (http://www.eric.ed.gov/), PsycINFO (http://www.apa.org/psycinfo/), and 
Google Scholar (http://www. scholar. google.com/). We first collected all journal articles with the extract key phrase “computational 
thinking” alone in all fields of each article and by doing so, we were able to include articles that contributed to assessment but did not 
mention assessment as a key term in their description. Although CT is related to computing, programming, and coding, we did not 
search for these words as alternative keywords because we considered CT as a distinct concept from computing/programming/coding 
and we only focused on studies in which researchers acknowledged CT and used the term “computational thinking.” 

With regard to the reference type, we decided to exclude conference papers in this review for the following reasons. (a) The journal 
articles have reported most of the significant scientific results according to the Bradford’s Law (Testa, 2009). In this case, conference 
papers would document similar results with those in journal articles. (b) Most conference papers on CT studies lack concrete infor- 
mation regarding their assessment, which may impede readers from making fair comparison between assessments. This argument is 
also bolstered by the literature review of Zhang and Nouri (2019), in which they stated that the lack of vital information in some CT 
studies can hinder the replication of studies. (c) The major drawback of not including conference papers though is that we may have 
missed the most recent development. To make up for this problem, using the same procedure as before, we conducted an updated 
literature search after the initial literature search. 

In particular, we used the following inclusion criteria to select articles that are (a) using “computational thinking" in any part of the 
paper (such as title, abstract, keywords, or main text); (b) published before August 2019; (c) peer-reviewed journal articles; (d) 
available in full-text; (e) empirical studies containing assessment outcomes in terms of CT skills or CT-related skills; and (f) written in 
English. After the initial search, we utilized a snowball method using the references of the selected articles, so that we could track 
articles which we may have missed in the earlier stage. This initial literature search resulted in 361 journal articles. The first author 
then downloaded all the initially selected articles. 

After the collection ofthe articles, the first and third author reviewed titles, abstracts, and methods sections together in order to rule 
out articles based on a set of exclusion criteria: (a) The “computational thinking" is not examined by an empirical research. (b) The 
article is a theoretical work or a content analysis of current CT materials. (c) No information on assessment is provided. We developed 
the list of inclusion and exclusion criteria by modifying established criteria used in earlier reviews (e.g., Haseski, Ilic, & Tugtekin, 
2018; Lockwood & Mooney, 2018; Shute, Sun, & Asbell-Clarke, 2017; Zhang & Nouri, 2019). Disagreements between the two authors 
were resolved through discussion and further review of the disputed studies. Based on the reduction of articles following the inclusion 
and exclusion criteria by a group review, the number of articles was narrowed down to 96 in the final review. Among them, the initial 
search produced 77 relevant articles published before December 31, 2018. A follow-up search for articles published before August 


X. Tang et al. Computers & Education 148 (2020) 103798 


2019 resulted in another 19 journal articles. 


2.2. Analysis 


We coded the literature by systematically classifying the texts into categories in three stages according to the procedures of a 
content analysis (Fraenkel, Wallen, & Hyun, 2015). First, we developed a coding scheme to systematically exact information from each 
selected article in accordance with the four research questions. An Excel spreadsheet was used to store and analyze all data. Second, 
using the initial coding scheme, the first three authors reviewed and coded the same 11 articles that were randomly chosen. During the 
review, we improved the initial coding scheme by modifying and clarifying categories. Third, after the coding scheme became stable, 
the first and third author coded eight articles independently and reached an inter-rater agreement of 91.8% across all the categories. 
The discrepancies were resolved by discussing with the second author. Finally, given that an acceptable inter-rater agreement has been 
achieved, the first author independently coded the rest of the articles. The basic coding results were shown in Appendix A and the 
review process was demonstrated in Fig. 2. Based on the coding results, frequencies and proportions for each category were computed 
and reported in the tables and figure. Then, a detailed explanation of the patterns emerged from the reviewed studies to support the 
four research questions was given in the Results section, coupled with examples for each category. 


3. Results 
3.1. Educational contexts of the CT assessments 


Educational levels: Researchers conducted studies and implemented CT interventions across various educational levels. As shown in 
Table 1, middle and elementary schools were the most researched educational levels that covered almost one third of the reviewed 
studies respectively, followed by high schools (15%) and colleges (15%). The rest integrated CT into teacher education programs or 
other professional practices (13%). Although some preservice teacher education programs were in college, we classified them into the 
category of teacher education to reveal the current state of professional development on CT applications. In general, considerable 
research on CT assessment has been devoted to K-8 schools, much more than high schools, colleges, teacher education, or other 
professional practices. 

Subject matter domains: We examined the subject domains where the reviewed CT assessments applied. As shown in Table 1, 
programming and CS were the subject matter most often researched (43%), which was consistent with the findings of Lockwood and 
Mooney (2018), followed by topics on robotics and game design (25%). A few studies (21%) implemented CT into non-CS STEM 
curriculum or activities. Non-STEM subjects, including poetry, journalism, etc., were the least researched (9%). Generally, most re- 
searchers restricted the development of CT skills within the CS related subjects, while other researchers extended CT skills to other 
non-CS subjects. Although CT by its nature originated from CS in an effort to encourage people to think like a computer scientist, 
several researchers argued that the importance of CT lay in its ability to deepen understanding of other subjects (e.g., Weintrop et al., 
2016; Wing, 2008). However, the results revealed that the implementation of CT in the non-CS classrooms still lagged behind. 


Searched articles from ERIC, 


; Developed the inclusion and Collected 361 articles based on the 

PsycINFO, Google Scholar with — ———* ; zs —— : e 

y f ae Sw exclusion criteria inclusion criteria 
key term "computational thinking 
Developed a coding scheme and set Identified 96 articles for the final Downloaded the selected articles 
up an excel spreadsheet TEVIeW based on the exclusion *— — and reviewed titles, abstracts, and 

criteria method sections 

The first three authors coded 10 The first and third author coded 8 The first author independenti 
randomly selected articles together > articles independently and reached Pi "n" a "ens Sci "T 
and modified the coding scheme an inter-rater agreement of 91.896 LUE 


Fig. 2. The literature review process of this study. 


X. Tang et al. Computers & Education 148 (2020) 103798 


Table 1 
Educational contexts and assessment tools of the reviewed CT assessments. 
Variables Categories Numbers Percent 
Educational level K-elementary 31 28.18% 
Middle 30 27.27% 
High 17 15.45% 
College 17 15.45% 
Teacher 15 13.64% 
Subject matter Programming and CS 40 43.48% 
Robotics and game design 23 25.00% 
Non-CS STEM 20 21.74% 
Non-STEM 9 9.78% 
Educational setting Formal 60 67.42% 
Informal 27 30.34% 
Assessment tool Traditional 34 24.11% 
Portfolio 47 33.33% 
Interview 22 15.60% 
Survey 38 26.95% 


Note: Some studies used more than one category so the total number (i.e., the denominator) to calculate the percentage is the total number of 
each feature used in these studies. 


As the social movement of CT has called for the involvement of CT into other disciplines in K-12 classrooms (diSessa, 2018), we 
analyzed the relationship between educational levels and subject domains manifested in the reviewed studies. We found that the 
distribution of the four major subject domains (i.e., programming, robotics/game, STEM, non-STEM) for the elementary school level 
was similar as those for the middle school level (Fig. 3). However, consistent with the previous finding of less research for high school, 
college, and teacher education, there was no intervention or curriculum pertinent to non-STEM subjects for high school and college 
students and surprisingly, no research on integrating CT with non-CS STEM subjects for teacher professional development in the 
reviewed articles. Although CT is regarded as at the core of all STEM subject domains (Henderson et al., 2007) and the idea to infuse CT 
into K-12 classrooms has been proposed for a decade, few studies have focused on the professional training for K-12 teachers on how to 
integrate CT with STEM subjects in practice. 

Educational settings: As shown in Table 1, about one-third of the studies were conducted in an informal educational context, which 
included summer camps, after-school programs, or other out-of-school activities. The remaining studies were conducted in the formal 
context (i.e. classrooms), covering more than half of studies. The high proportion of studies conducted in the formal setting appeared to 
align with the move of involving CT into subject domain standards. 


3.2. Constructs of the CT assessments 


As discussed in the introduction, CT has been defined across the literature with both divergence and agreement. Hence, we were not 
surprised to find great diverseness in the ways that CT constructs were defined and operationalized in assessment. We constructed a 
hierarchical structure to classify the common patterns that emerged among all the constructs being measured based on two classifi- 
cations: (a) Sullivan and Heffernan (2016)’s categories of first-order and second-order learning outcomes regarding the relationship 
between CT and other subject domains and (b) McMillan (2013)’s categories of cognitive constructs and non-cognitive dispositions and 
skills. 

The first level of this structure denotes a classification of the first-order use of constructs and the second-order use of constructs 
(Sullivan & Heffernan, 2016). The first-order usage of knowledge refers to a direct cognitive manifest of the studied domain knowl- 
edge, which refers to CT skills defined and assessed independently of other subject domains in this review. The second-order 
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Fig. 3. Relationship between educational levels and subject domains of the reviewed CT assessments (color). (For interpretation of the references to 
color in this figure legend, the reader is referred to the Web version of this article.) 
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application denotes a representation of relevant domain knowledge as a derivative of CT integration, such as programming skills, 
STEM, and non-STEM knowledge. Based on the first level classification of the first-order and second-order usage, we then divided each 
into the second level classification of cognitive and noncognitive constructs based on McMillan's (2013) identification of constructs. 
We further classified the categories of second-order cognitive and non-cognitive constructs according to the relevant subject domains 
as the third level of this structure. Finally, we ended up with seven sub-categories and computed their proportions and frequencies 
researched by the reviewed articles (shown in Table 2) and we also listed some sample questions in Appendix B. 

First-order cognitive CT: More than one-third of the reviewed studies directly assessed CT concepts and skills. They were derived 
from CS or programming concepts but were further derived as cognitive thinking skills and can be applicable to all subject domain 
problem-solving processes. Hence, teachers without knowledge in CS and students without any prior experience in programming are 
able to understand and apply these types of higher order thinking skills in teaching and learning. The first-order use of CT has been 
supported by CSTA & ISTE, 2011. They listed examples of applying each component across various subjects including math, science, 
history, etc. As an example of implementing the CSTA and ISTE’s (2011) CT framework and Selby and Woollard’s (2013) CT defi- 
nitions, Bebras International Contest was initiated in Lithuania in 2004 which aimed to promote CT among K-12 students (Cartelli, 
Dagiene, & Futschek, 2012, pp. 35-46; Dagiene & Futschek, 2008). Different from programming-based CT tests, the Bebras tasks aimed 
to elicit students’ CT (Palts, Pedaste, Vene, & Vinikiené, 2017) to solve problems for potential real-life scenarios without tapping 
specific subject contents and using programming platforms or devices. From another perspective, ISTE (2015) summarized CT as a 
common reflection of creativity, algorithmic thinking, critical thinking, problem solving, cooperative thinking, and communication 
skills. Durak and Saritepeci (2018) designed a CT questionnaire with these five factors as subscales and found CT thinking skills were 
positively correlated with general thinking skills. 

First-order non-cognitive: About seven percent of studies assessed participants’ dispositions and attitudes toward CT. For example, 
Mesiti, Parkes, Paneto, and Cahill (2019) designed a survey to elicit middle and high school students’ self-efficacy and interests related 
to problem decomposition after completing CT tasks mediated by animated films. Considering professional development as an 
important role in teachers’ recognition and application of CT in their classrooms, Leonard et al. (2017) employed an existing teachers’ 
CT belief survey to probe their comfort, interest, and classroom practice of CT skills. However, few studies have focused on the dis- 
positions that would strengthen CT skills and its components, such as the dispositions of persistence and tolerance for ambiguity as 
suggested by CSTA & ISTE, 2011. 

Second-order cognitive programming: Programming and computing related concepts were the constructs measured with one-fourth of 
the reviewed studies. Many researchers defined CT in terms of programming or computing concepts, so that they designed in- 
terventions based on programming activities and assessed students’ programming skills as evidence of students’ CT skills. For example, 
Scratch, a block-based programming environment, was developed at MIT for users to design interactive media. Later, an assessment 
tool of Scratch, Dr. Scratch, was developed and utilized by many researchers to evaluate users’ Scratch projects and assess their 
programming and CT skills. Echoing the core ideas of Scratch, Brennan and Resnick (2012) proposed the 3D CT framework in which 
the cognitive aspect of CT was computational concepts. The application of this framework and the Scratch platform orchestrated a 
collection of studies that defined CT concepts congruently with programming concepts (e.g., Falloon, 2016; Garneli & Chorianopoulos, 
2018; Grover, Pea, & Cooper, 2015; Jenkins, 2015; Lye & Koh, 2014; Pugnali, Sullivan, & Bers, 2017; Roman-Gonzalez, 
Pérez-Gonzalez, Moreno-León, & Robles, 2017a, 2017b; Zhong, Wang, Chen, & Li, 2016a, 2016b). For example, Pugnali et al. (2017) 
defined CT using programming concepts, including sequencing, loops, conditionals, and debugging, and analyzed students' projects 
using a graphical coding application on the iPad (Scratch Jr.) and tangible programmable robotics kit (KIBO). It seems natural to define 
CT by referring programming concepts because they are sometimes related to each other and sharing similarities. However, highly 
relying on programming terms to define the constructs of CT assessment reveals a lack of connection or applicability to non-CS 
subjects. 

Second-order cognitive STEM: Nine percent of the studies embedded CT into non-CS STEM interventions and measured non-CS STEM 
knowledge as the constructs. For example, by using a CT-based learning environment for K-12 science, Computational Thinking in 
Simulation and Modeling (CTSiM), Basu et al. (2016) argued that a lack of domain knowledge was one of the challenges students faced 


Table 2 
A classification structure of CT constructs measured in the reviewed studies. 
Sullivan & Hefferman McMillan Subject domains Examples (sample questions in Appendix B.) Frequency Percent 
(2018) (2013) 
First-order Cognitive CT skills Bebras items (Dolgopolovas, Jevsikova, Dagiene, & 61 41.7896 
Savulioniene, 2016) 
Non-Cognitive CT propositions CT perceptions (Leonard et al., 2018) 11 7.5396 
Second-order Cognitive Programming/ Scratch projects (Grover et al., 2015) 33 22.60% 
computing/CS 
STEM (exclude CS) Lattice Land (Pei et al., 2018) 14 9.59% 
Non-STEM Poetic thinking and CT skills (Jenkins, 2015) 1 0.68% 
Non-Cognitive Programming/ Perceptions about programming (Adler and Kim, 2018) 21 14.38% 
computing/CS 
STEM/non-STEM Comfort, interest, and classroom practice of CT skills ( 5 3.42% 


Sáez-López et al., 2016) 


Note: Some studies assessed more than one construct so the total number (i.e., the denominator) to calculate the percentage is the total number of each 
construct type used in these studies. 
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when developing their programming and simulation models. With a focus on physics knowledge, Jaipal-Jamani and Angeli (2017a, 
2017b) assessed pre-service teachers’ understanding of gears and relevant physics knowledge after participating in activities with 
LEGO WeDo robotics kits. To build a bridge between CT and math proficiency, Pei, Weintrop, and Wilensky (2018) examined how 
students’ mathematical habits of mind aligned with CT in terms of four overarching categories: data, modeling and simulation, 
computational problem-solving, and systems thinking. These studies showcased the CT infusion into STEM education and assessment. 

Second-order cognitive non-STEM: Only one CT assessment study integrated CT into non-STEM subjects. Jenkins (2015) examined 
the potential of using a block-based programming poem generator program to evaluate students’ development of CT skills and poetic 
thinking in English. The test of poetic thinking and CT skills measured the statutory literacy skills of ‘adapting structures in writing’, 
‘using a wide range of sentence structures’ and ‘using knowledge of word roots and families’ with the programming concepts of se- 
quences, loops and events that were defined in the Brennan and Resnick’s (2012) CT framework. 

Second-order non-cognitive: Around fifteen percent of the studies examined participants’ dispositions and attitudes toward CT- 
related concepts and activities. For instance, Adler and Kim (2018) created a web-based simulation to help teach Newton’s Second 
Law of Motion in the science methods course for preservice teachers. After working with the simulation, preservice teachers were asked 
to complete a post-questionnaire relating to their experience with this simulation and Scratch programming. As an example of probing 
elementary students’ self-efficacy of non-STEM subjects after receiving CT-integrated arts curriculum, Sáez-López et al. (2016a, 
2016b) used a survey to examine students' perceptions of art history. 

In summary, the above review suggests that many studies chose to measure programming or computing concepts as a represen- 
tation of CT skills or defined cognitive CT skills using programming and CS concepts. We consider CT and CS as related, but not the 
same. Although it is understandable that CT skills can be defined and operationalized differently, most of the reviewed articles have not 
explicitly explained why assessing CT skills can be interchangeable with assessing programming or computational concepts. 


3.3. Tools for assessing CT 


Despite the variability in assessment design for CT, four categories of assessment types emerged from the literature, including the 
traditional test composed of selected- or constructed-response questions, portfolio assessment, interviews, and surveys. Some studies 
employed more than one assessment type in order to collect multidimensional evidence of students' CT skills. The frequency of each 
type was revealed in Table 1. 

Selected- or constructed-response tests: Some studies (2496) chose to develop an assessment composed of multiple choice and/or open- 
ended questions, usually evaluated by correctness and completeness and designed for summative purposes. For instance, Jenson and 
Droumeva (2016) designed a test to evaluate students' existing knowledge of CS concepts such as variables, operations, and functions 
as evidence of student CT proficiency. College students' learning and retention of CT skills was assessed in both Peteranetz, Flanigan, 
Shell, and Soh (2017) and Flanigan, Peteranetz, Shell, and Soh (2017a, 2017b) studies via a web-based, 13-item test developed by CS 
and engineering faculty. The test addressed common core CS contents including selection, looping, arrays, functions, algorithms, 
search, and sorting. 

The usage of this traditional assessment type suggests a trend: although CT is considered as a cognitive thinking process, many 
researchers regard it as a learning product (e.g., knowledge or skills gained) when assessing it. It further implies that CT skills can be 
regarded as quantifiable mastery of knowledge regarding CT components. Practically speaking, teachers and researchers may find a 
reliable and valid knowledge assessment, which can be conveniently adapted into their classrooms or interventions and use it to 
measure students' CT knowledge for a summative assessment purpose. From another perspective of promoting an ongoing learning 
experience, Fields, Lui, and Kafai (2019) argued that the traditional assessment failed to capture the process of CT learning when 
students worked with their hands-on projects. 

Portfolio assessment.: Portfolio assessment, a type of performance assessment, refers to a purposeful, systematic process of collecting 
and evaluating various types of student products to examine the attainment of learning targets (McMillan, 2013). More than one-third 
of the reviewed studies used portfolio assessment to evaluate students’ CT skills through their projects, notes, or other direct obser- 
vations. Like a performance assessment, researchers used a grading rubric to indicate different levels of achievement for each 
dimension of CT performance or a checklist to indicate whether a certain criterion is met (i.e., binary coding). For example, Bers, 
Flannery, Kazakoff, and Sullivan (2014) evaluated students’ knowledge of debugging, correspondence, sequencing, and control flow 
by assigning the appropriate level of achievement based on a scoring rubric to each student's robotics project. 

Another widely used portfolio assessment comes from the analysis of Scratch projects. Moreno-León, Robles, and Román-González 
(2017a, 2017b) utilized the rubric provided by the abovementioned Dr. Scratch to analyze students’ Scratch projects on seven con- 
cepts: abstraction and problem decomposition, parallelism, logical thinking, synchronization, algorithmic notions of flow control, user 
interactivity, and data representation, and at three mastery levels of each dimension: basic, developing, proficient. As this type of 
rubrics are typically graded by human raters, it required a clear distinction between levels of performance for raters to identify the 
rating that best represented a student's CT level. 

Another method of portfolio analysis counted the presence of each CT dimension. It is useful for researchers who are interested in 
tracking which CT components students used more frequently through verbal communication or analysis of students' projects. Denner 
et al. (2012) coded student-designed games and counted the presence of each coding category to identify how often they used the 
features in the Stagecast Creator program, a programming platform that allows users to choose visual characters and before-after rules 
instead of writing programming commands. 

To summarize, the application of portfolio assessment could capture a holistic view of what skills students have obtained through 
projects or work products. For studies involving hands-on collaborative projects, portfolio assessment would help to promote and 
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evaluate students’ communication in CT literacy as indicated by Lui et al. (2019a, 2019b) and can also serve as a formative assessment 
tool providing students with feedback potentially beneficial to their future learning. Many researchers chose to use portfolio assess- 
ment when they employed programming platforms or instructional tools in their interventions to support students’ hands-on learning 
activities. From another perspective, Chen et al. (2017a, 2017b) pointed out some challenges of using performance assessment to 
measure CT: it may limit the use of a pre-post design and can only be implemented on a particular programming or computing 
platform. They called for an assessment convenient for teachers to use in classrooms and applicable across platforms so that students’ 
CT performance can be comparable across studies using different instructional tools. 

Surveys: Surveys are often used for investigating affective/non-cognitive learning outcomes, particularly for motivations and at- 
titudes toward CT learning. Twenty-six percent of studies developed surveys that used quantitative items (e.g., a Likert scale) and/or 
open-ended questions. A majority of surveys were designed to collect student self-report responses, while some were administered to 
teachers to collect their perceptions of CT during professional development interventions (Yadav et al., 2014). Although surveys were 
often used to probe participants’ non-cognitive outcomes, several studies used surveys to elicit students’ cognitive CT skills. For 
example, Bower, Wood, Lai, Howe, Lister, Mason, Highfield, et al. (2017) conducted qualitative thematic analysis of students’ 
open-ended responses to identify students’ usage of CT concepts and examine their awareness of CT. The frequent use of the survey 
approach reveals its advantages of convenient administration, efficient data collection from a large sample size, and quantifiable 
results. However, its self-report feature leads to the limitation of tapping students’ spontaneous perceptions or attitudes. In addition, 
surveys might not be the best way to understand young children’s interests toward CT as they might not understand the survey 
questions. To fill this gap, the usage of surveys could be coupled with other methods (e.g., interviews, focus groups) in order to 
investigate students’ emergent in-depth thinking processes. 

Interviews: Researchers conducted interviews to probe participants’ understanding of CT skills and coded their behaviors or verbal 
communication using pre-developed protocols in fifteen percent of the reviewed studies. This approach is particularly useful for 
exploring the uncertainty concerning how participants developed and applied CT. Ina study by Cetin (2016), pre-service teachers were 
interviewed and encouraged to express their experience of Scratch-based classroom teaching. Cetin (2016) thus analyzed the emergent 
patterns from the interview data. Atmatzidou and Demetriadis (2016) used a more diagnosed approach, the think-alouds strategy 
(Ericsson & Simon, 1998), by asking students to speak out when solving robot programming tasks. They analyzed the CT concepts 
students used and interviewed students about their perceptions of CT concepts, understanding of basic programming concepts, and 
their views of development of CT skills. Researchers usually employed interviews to support or elaborate on the results of traditional or 
portfolio assessment by specifying students’ thinking processes of using CT skills to solve problems or difficulties they faced when 
working on the CT-related hands-on projects. The challenges of using interviews, on the other hand, include its high cost and long time 
spent on interviewing and coding the data as well as its small distribution to students, which makes it difficult to be quantified. 

As demonstrated before, more than one third of studies (see in Table 3) chose to use more than one assessment tool to measure 
various CT constructs in order to examine different aspects of student CT learning experiences. Specifically, around 20 percent studies 
employed traditional tests together with other assessment tools in order to hold a more comprehensive view of students’ CT perfor- 
mance. For example, Grover et al. (2015) employed a traditional knowledge test to assess students’ CT knowledge, used surveys to 
probe students’ perceptions of CT and CS, and interviewed low-performing students about their difficulties in applying CT to complete 
their final projects. However, in concert with the less frequent single use of interview, a combined use of assessment tools to collect 
both quantitative and qualitative data (i.e., interview data) was rarely studied. This lack of combined use highlights a need for a more 
in-depth investigation of cognitive process of developing CT skills. 

As most studies involved interventions, the pre-post assessment design was used to evaluate the intervention effects for 27 percent 
of the studies. Further, when exploring the proximity between instruction and assessment according to the theory of Ruiz-Primo, 
Shavelson, Hamilton, and Klein (2002), we found that most assessments were sensitive to their instructional modules and focused on 
intervention related knowledge. Although the reviewed studies showed evidence for the improvement of CT by conducting corre- 
sponding instruction or intervention, it is unclear whether CT could help to improve students’ knowledge of other disciplines or 


Table 3 
The frequency of each single use and combined use of CT assessment tools. 
Assessment tools Frequency Percent 
Single use Traditiona 15 15.63% 
Portfolio 24 25.00% 
Interview 8 8.33% 
Survey 18 18.75% 
Combined use Traditional + Portfolio 6 6.25% 
Traditional + Survey 4 4.17% 
Traditional + Portfolio + Interview 1 1.04% 
Traditional + Portfolio + Survey 4 4.17% 
Traditional + Interview + Survey 3 3.13% 
Traditional + Portfolio + Interview + Survey 1 1.04% 
Portfolio + Interview 5 5.2196 
Portfolio + Survey 4 4.17% 
Portfolio + Survey + Interview 2 2.08% 
Survey + Interview 1 1.04% 


Note: The denominator is the total number of the reviewed studies (N = 96). 
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thinking skills in general. Given that CT can be implemented in teaching different disciplines, it would be interesting to explore if 
students still had significant learning improvements using CT assessment tools free from particular programming platforms and 
specific interventions and materials. 


3.4. Reliability and validity evidence of the CT assessments 


Among the 96 studies, fewer than half (45%) reported reliability evidence and 18 percent provided validity evidence. Some 
reliability evidence (i.e., inter-rater reliability or Cohen’s kappa) was provided when human coding was undertaken, especially for 
interviews, performance assessments, and open-ended responses. Some researchers who employed multiple choice or survey questions 
reported internal consistency of tests to determine if students received similar scores on items measuring the same CT constructs. 

Of the studies reporting validity, some researchers reported various types of validity evidence. For instance, Román-González, 
Pérez-González, and Jiménez-Fernández (2017a, 2017b) correlated student's CT score with their problem-solving test scores, an 
assessment evaluating speed and flexibility in performing logical operations. In another case, Sáez-López et al. (2016a, 2016b) 
emphasized the importance of using multiple methods to conduct data triangulation and validation. Collaborating with nine subject 
matter experts, they evaluated the content validity of their survey questionnaire on perceptions of art history and CT and reported an 
acceptable level of Aiken V content validity index. They also examined the construct validity of the questionnaire by using exploratory 
factor analysis. As another example of examining content validity, Djambong and Freiman (2016) organized a panel of experts to 
evaluate if their test contents corresponded to the intended constructs. They also suggested to analyze students' think-alouds to 
examine the validity evidence of Bebras items. Further, to evaluate the validity of Bebras items, Araujo, Andrade, Guerrero, and Melo 
(2019) conducted confirmatory factor analysis to determine if there is statistical evidence to support that Bebras tasks assessed all the 
claimed CT skills. Korkmaz et al. (2017a, 2017b) developed a five-point CT scale consisting of 29 items to examine students' CT skill 
levels. They studied the construct validity of this scale by using exploratory factor analysis, confirmatory factor analysis, and examined 
item distinctiveness analyses through an independent-samples t-test among students of different performance levels. They further 
evaluated reliability based on split half correlations of the scale, Spearman Brown reliability coefficient, Guttmann Split-Half value, 
and Cronbach Alpha coefficient. Test-retest was also applied to evaluate the measurement consistency. This study showcased a solid 
process of reliability and validity analyses of a self-developed CT scale. 

Although some reliability and validity information are provided for CT assessments, most of the CT assessments lacked reliability 
and validity evidence. Without this evidence, it is difficult for the field to use these assessments with confidence in classrooms to 
measure students' CT learning, especially in high-stake tests. 


4. Discussion 


CT is a relatively fast-moving field that has been explored by researchers for the last decade. Through this systematic review, we 
mapped the current territory of CT assessment implemented across all educational levels, identified what has been explored, and 
recognized the research gaps. 

First, the majority of studies tended to focus on cultivating CT in the elementary and middle school grade levels. It is worth noting 
that although it is challenging to conduct CT assessments developmentally appropriate for younger children due to their limited 
reading and understanding skills (Zhang & Nouri, 2019), researchers have tried to implement CT at early stages of students' cognitive 
development. However, as no rationale suggests that elementary and middle schools are the only critical stage for students to learn CT, 
it is necessary to further enrich the literature on examining CT assessments appropriate for high school (Flórez et al., 2017) and college 
students. By doing so, researchers and practitioners can find resources for a complete developmental trajectory of students' CT skills. 

Second, more CT assessments should be developed for CT interventions or activities applied in informal educational contexts. 
Grover and Pea (2013), Kjallander, Akerfeldt, Mannila, and Parnes (2018), and Martin (2017) suggested that informal educational 
contexts such as makerspaces and DIY workshops play critical roles in implementing CT education. Hadad, Kachovska, Thomas, and 
Yin (2019) observed various forms of formative assessments when facilitating scaffolded instruction of CT with high school students in 
a making context. According to our review, CT assessment was lacking in informal contexts, so that it was difficult for the research 
community to evaluate the impact of these informal-context-based interventions on improving CT. 

Third, CT assessment for teacher education and professional development workshops was insufficiently researched. Kang, Dono- 
van, and McCarthy (2018) found that teachers had less knowledge and confidence toward teaching CT compared to teaching other 
subject contents, and English (2018) claimed that it was problematic to introduce CT in classrooms without teacher professional 
development. Thus, the field needs more solid research and practice to help teachers implement CT integration. Accordingly, more 
assessments to measure teachers’ CT proficiencies should be developed as well. 

Fourth, more assessments should be developed to emphasize the alignment between CT skills and domain knowledge, so that they 
can better serve the trending integration of CT into STEM and non-STEM subjects. CT has the potential to deepen students’ under- 
standing of various subject domains, including that of STEM, non-STEM, and everyday life problem solving (CSTA & ISTE, 2011; 
Weintrop et al., 2016; Wing, 2008). Although it is necessary to further verify how CT will help students build thinking skills in general, 
Wing (2011, pp. 20-23) emphasized that CT has begun to influence other disciplines as it can go beyond computer science. In addition, 
researchers (Jacob, Nguyen, Tofel-Grehl, Richardson, & Warschauer, 2018; Weintrop et al., 2016) stated that bringing CT into both 
STEM and non-STEM classrooms would prepare students for the 21st century digital citizenship. From the perspective of learning 
sciences, Grover and Pea (2018, pp. 19-37) suggested the mediating role of CT interventions and assessments that could prepare 
students for learning STEM and non-STEM subjects. In light of the benefits of applying CT into other subject domains, we call for 
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CT-embedded assessments that evaluate domain knowledge for researchers and educators to adapt in their interventions or class- 
rooms. Further, compared to the assessments infusing CT with STEM subjects, the integration of CT into non-STEM assessments is 
seriously lacking. Assessment in this field is necessary, given that the integration of CT with non-STEM fields are growing (e.g., 
Jenkins’s, 2015, poetry teaching based on a CT approach). 

Fifth, interviews were currently under-utilized for measuring CT and deserved more attention. Given that CT functions as a set of 
complex mental operations, it is necessary to explicate the components embedded in this cognitive process (Ambrosio, Almeida, 
Macedo, & Franco, 2014). Researchers may conduct more interviews or think-alouds to collect in-depth, qualitative data as suggested 
by Werner, Denner, and Campe (2014). Especially with the current tools, such as automatic transcription and analysis technology, 
interview including think-alouds may play a more important role than before in studying students’ thinking. 

Finally, reliability and validity evidence of assessment tools should be reported when applicable. As reliability and validity in- 
formation serve as indicators of assessment quality, their results would help researchers to make revisions when improving current CT 
assessments and developing new CT instruments. Further, the report of acceptable reliability and validity indices would bolster the 
broader distribution of the CT instrument as researchers and educators would prefer to use a reliable and valid assessment tool. 

Even more fundamentally, the lack of consensus on the definition of CT resulted in a confusion in the constructs of CT assessment. 
Researchers (e.g., Hadad & Lawless, 2014; Haseski et al., 2018; Yadav, Good, Voogt, & Fisser, 2017) called for an agreement on the 
definition of CT and an alignment between CT application and subject domains. Without a theoretical framework to organize the 
consistency among various definitions of CT, it is difficult to reach a comprehensive understanding of what constitutes CT, how CT is 
different from other thinking processes, and how to assess students’ CT. Specifically, as noted in this review, the partial overlap of CT 
and programming skills lends to the conflation of CT and programming assessments, which is common in most of the current CT 
literature. When tracing back to the calls from Papert (1980) and Wing (2006; 2008), CT and programming concepts should be viewed 
as separate but compatible cognitive tools that respectively expand the ways of problem-solving. Given the current state of CT 
assessment, it is particularly critical for future research to investigate a theoretical framework of learning (e.g., Alexander’s Model for 
Domain Learning) and assessment (e.g., Evidence-Centered Design) to disentangle the unique learning development of CT skills and 
capture student CT performance. 

Another gap is that the current reliance on programming concepts as the constructs of CT assessment requires computing or 
programming proficiency for test designers and teachers. It may deter the application of CT skills in other subject domains from people 
who are unfamiliar with CS and/or from schools or after-school programs without sufficient access to computers and programming 
platforms. The lack of CT assessment in non-STEM fields may be largely due to this confusion between CT and programming skills. As 
suggested in the review of Kalelioglu et al. (2016), CT can be implemented in broader learning contexts rather than focusing on 
computational solutions. Further, Nishida et al. (2009) and Lockwood and Mooney (2018) suggested that the introduction of CT 
learning could be done without computers; and Sentance and Csizmadia (2017) found that teachers preferred to use unplugged, 
hands-on activities without computers to teach CT. Hennessey, Mueller, Beckett, and Fisher (2017) conducted a content analysis of the 
Ontario elementary school curriculum by examining the terms or phrases associated with CT processes. They found that most of the 
programming-related terms from Brennan and Resnick’s (2012) framework were rarely mentioned. Hence, more CT assessments free 
from computers or programming platforms should be developed in order to broaden the integration of CT skills into curriculum and 
programs that lack technology equipment or professional resources and to connect with students who do not see themselves as part of 
the dominant computing culture. The “unplugged” CT assessment, which corresponds to Grover and Pea’s (2018) call on teaching and 
learning CT skills without a computer or programming, can provide direction on CT skills in non-computing environments. As one 
example of such studies, Yin, Hadad, Tang, and Lin (2019) designed a CT achievement assessment that measures CT-integrated STEM 
learning and grounded test questions in the context of maker activities regarding physics and engineering without using computer or 
maker tools. 

Practically speaking, CT can be integrated into all subject domains across all educational levels. Different types of assessment tools 
can be used for different educational purposes. Specifically, (a) teachers may use a traditional test with selected- and/or constructed- 
response questions to evaluate students’ CT knowledge for a summative purpose, and researchers are able to evaluate the effect of their 
CT intervention through carrying out a traditional CT test in the pre-post mode. (b) In order to glean an impression of students’ 
application of CT when working on hands-on projects, portfolio-driven approach situates CT assessment in a real-world context and 
further allows teachers and researchers to provide formative feedback for students to improve their understanding and experience in 
CT. (c) Surveys play an important role in probing students’ perceptions and attitudes toward CT, which would also inform teachers and 
researchers of the potential improvement they could make to correspond to students’ self-efficacy and motivations in learning CT. (d) 
Researchers may use interviews as a qualitative approach to undertake case studies for a thorough investigation of a student’s problem- 
solving process based on CT and challenges they encounter, which could supplement the summative results of traditional assessments 
or the formative results of performance assessments. In addition, researchers could employ think-alouds to understand how students 
come up with the solution to each CT question and to examine whether they applied the CT skills the test claimed to measure. Such 
information would allow researchers to improve their CT assessment. (e) Finally, the combination of different assessment tools may be 
used as well to triangulate different CT assessments and provide a comprehensive evaluation of students’ CT learning. 


5. Limitations 
In the wake of the popularity in CT, researchers (e.g., Brown, 2017; Florez et al., 2017) have indicated concerns over the repre- 
sentation of gender and minority groups in CT-related education and research. Consistent with these concerns, many studies reported 


demographic information when they conducted CT-related interventions. Among the interventions reviewed in this study, 67% of the 
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reviewed studies reported gender information and 34% provided information on race and ethnicity. Most of them balanced the gender 
ratio, and many studies involved students of color, but few studies reported any information about the assessment features related to 
students’ demographic backgrounds. 

We decided not to focus on the demographic features of the participants in our review for the following reasons. Although many 
studies conducting CT-related interventions reported demographic information, no study reported how they developed their assess- 
ment based on specific participants’ demographic features or how the assessment features were associated with student demographic 
information, except for their grade levels. This might be because the CT assessment, as most assessments in education do, are 
intentionally developed to serve a general population rather than a specific population to avoid stereotyping and bias (McMillan, 
2013). However, applying CT assessment in different demographic groups may contribute to extending the current literature. Future 
research may study the demographic issues, such as item bias, cultural responsiveness of CT assessments, and differential item 
functions of CT assessments. 


6. Conclusion and suggestions 


This literature review analyzes specific CT assessments and identifies current gaps and future directions to conceptualize and 
measure CT skills. 

The results showed that CT assessments were developed and applied unequally in educational contexts. They were more frequently 
used in elementary and middle schools than higher grades, and in formal educational settings more than informal ones. With a larger 
proportion of the studies related to programming or CS, more research needs to study how CT assessment can be applied to the broader 
content subjects. Studies measured a variety of CT related constructs, ranging from scientific knowledge, direct CT skills, computing 
skills, to attitudinal perceptions toward CT. The most commonly measured constructs were students’ direct CT skills and programming 
skills. Around 80 percent of studies assessed cognitive constructs while the rest measured affective outcomes. Four types of CT 
assessment emerged from literature: traditional assessment composed of selected- and/or constructed-response questions, portfolio 
assessment, survey, and interview. The majority of studies employed traditional paper-pencil tests and portfolio assessments to 
evaluate CT and CT-related skills. Finally, only a few of the studies reported the reliability and validity evidence of their assessments. 

CT is a fascinating and broad field. Our review intends to map the CT assessment territory which has been explored. The results 
show that great work has been done but more is still needed. In particular, we suggest that researchers and CT assessment developers 
consider the following takeaways when designing CT assessments: (a) create more CT assessments for high school, college, professional 
development, and informal education settings; (b) focus on the assessment constructs aligned with the corresponding CT definitions 
and with the subject-matter knowledge to promote the integration between CT and subject domains; (c) consider the concurrent use of 
qualitative measures collected by interviews, think-alouds, or focus groups to better understand students’ proficiency of CT; (d) report 
reliability and validity evidence to confidently qualify the assessment; (e) ground the definitions of CT with unique features that can be 
distinct from programming, computing or other similar concepts; and (f) design common CT assessments that can be applicable across 
platforms and devices in order to compare students’ CT performance under varied conditions of intervention. 

Finally given that CT assessments have been developed in different fields for different grades, it would be helpful to build a 
searchable database where the assessment tools are systematically collected, categorized, and organized, so that both researchers and 
practitioners who need CT assessments can readily search for those tools to avoid reinventing the wheel. Instead they can build new 
improvement on the existing achievement, e.g., empirically examining the reliability and validity of existing assessments and 
developing new assessments to measure CT in a new field or unexamined student levels or teacher professional development. 

CT can be applied in different subjects across different grade levels, which brings challenges as well as opportunities. Teachers and 
researchers across different disciplines and different educational levels should increase collaboration so that we can assess and pro- 
mote CT systematically. In this way, assessments can be designed to map the learning progression of CT in each discipline and to 
encourage students to apply CT skills into learning other disciplines. 
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Appendix A. The Reviewed Computational Thinking Studies 


Authors School level Subject domain Educational Assessment tools 
setting 
Adler and Kim (2018) Teacher CS/Programming Formal Survey 
Altanis, Retalis, and Petropoulou (2018) Middle Robotics/Game and CS/ Formal Portfolio 
Programming 
Angeli and Valanides (2019) K-elementary Formal Traditional, Portfolio 
Asad, Tibi, and Raiyn (2016) Elementary CS/Programming Formal Survey 
Atmatzidou and Demetriadis (2016) High Robotics/Game Formal Traditional, Survey, 
Interview 
Bagley and Rabin (2016) College STEM Formal Interview 
Basawapatna (2016) Middle STEM Formal Portfolio 
Basogain, Olabe, Olabe, and Rico (2018) Elementary, Middle, CS/Programming Formal Traditional, Portfolio 
High 
Basu et al. (2016) Middle STEM Formal Traditional, Portfolio 
Basu et al. (2017a, 2017b) Middle STEM Formal Traditional, Portfolio 
Benakli, Kostadinov, Satyanarayana, and Singh College STEM Formal Survey 
(2017) 
Berland and Lee (2011) College Robotics/Game Informal Interview 
Bers (2010) K-elementary Robotics/Game and STEM Formal Portfolio 
Bers et al. (2014) K-elementary Robotics/Game and CS/ Formal Portfolio 
Programming 
Bower, Wood, Lai, Howe, Lister, Mason, Highfield, Teachers non-STEM Informal Survey 
et al. (2017) 
Brady et al. (2017) High CS/Programming Formal Survey 
Bucher (2017) Teachers non-STEM Informal Interview 
Burleson et al. (2018) Elementary CS/Programming Informal Portfolio 
Cetin (2016) Teachers CS/Programming Formal Traditional, Survey 
Cetin and Andrews-Larson (2016) Teachers CS/Programming Formal Traditional, Survey, 
Interview 
Chang and Peterson (2018) Teachers non-STEM Formal Interview 
Chen et al. (2017a, 2017b) Elementary Robotics/Game Formal Traditional 
Choi, Lee, and Lee (2017) Elementary CS/Programming Formal Traditional 
Csernoch, Biró, Math, and Abari (2015) College STEM Formal Portfolio 
Denner et al. (2012) Middle Robotics/Game and CS/ Informal Portfolio 
Programming 
Denner, Werner, Campe, and Ortiz (2014) Middle CS/Programming Informal Portfolio, Survey 
Dolgopolovas et al. (2016) College STEM Formal Traditional 
Durak and Saritepeci (2018) Middle, High Traditional, Survey 
Falloon (2016) Elementary STEM Formal Portfolio 
Flanigan et al. (2017a, 2017b) College CS/Programming Formal Traditional 
Gadanidis, Clements, and Yiu (2018) Elementary STEM Formal Portfolio 
Gandolfi (2018) College Robotics/Game Informal Survey 
Garneli and Chorianopoulos (2018) Middle STEM and CS/Programming Informal Portfolio 
Grover et al. (2015) Middle CS/Programming Formal Traditional, Portfolio, 
Survey, Interview 
Giinbatar and Bakirci (2019) Teachers non-STEM Formal Survey 
Hershkovitz et al. (2019) Elementary CS/Programming and Informal Portfolio 
Robotics/Game 
Hestness, Jass Ketelhut, McGinnis, and Plane Teachers CS/Programming Informal Portfolio, Interview 
(2018) 
Hsiao et al. (2019) Elementary Robotics/Game and CS/ Formal Traditional 
Programming 
Izu, Mirolo, Settle, Mannila, and Stupuriene K-elementary,Middle, High Traditional 
(2017) 
Jaipal-Jamani and Angeli (2017a, 2017b Teachers Robotics/Game Formal Traditional, Portfolio, Survey 
Jenkins (2015) Middle non-STEM Formal Traditional 
Jenkins (2017) Middle non-STEM Formal Portfolio 
Jenson and Droumeva (2016) Elementary Robotics/Game and CS/ Informal Traditional 
Programming 
Jun, Han, Kim, and Lee (2014) Elementary Formal Traditional 
Kale, Akcaoglu, Cullen, and Goh (2018) Teachers Survey 
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(continued ) 
Authors School level Subject domain Educational Assessment tools 
setting 
Kong, Chiu, and Lai (2018) Elementary Formal Survey 
Korkmaz and Bai (2019) High Survey 
Korkmaz et al. (2017a, 2017b) College Formal Survey 
Korucu, Gencturk, and Gundogdu (2017) Elementary Formal Survey 
Kukul and Karatas (2019) Middle, High Survey 
Lachney, Babbitt, Bennett, and Eglash (2019) Middle STEM and CS/Programming Informal Portfolio, Survey, Interview 
Lai, Chen, Lai, Chang, and Su (2019) College CS/Programming Formal Portfolio, Survey 
Lee et al. (2014) Elementary, Middle Robotics/Game Informal Interview 
Leonard et al. (2016) Middle Robotics/Game Formal Portfolio 
Leonard et al. (2018) Teacher, K-12 Robotics/Game Formal Portfolio, Survey 
students 
Liu, Zhi, Hicks, and Barnes (2017) Middle Robotics/Game and CS/ Informal Survey 
Programming 
Lui et al. (2019a, 2019b) High STEM Formal Portfolio 
Malyn-Smith and Lee (2012) Middle, High, STEM Informal Portfolio 
College 
Marcelino, Pessoa, Vieira, Salvador, and Mendes Teachers CS/Programming Formal Portfolio, Interview 
(2018) 
Mesiti et al. (2019) Middle, High Informal Portfolio, Survey, Interview 
Moreno-León et al. (2017a, 2017b) Informal Portfolio 
Mouza, Marzocchi, Pan, and Pollock (2016) Middle CS/Programming Informal Traditional, Portfolio, Survey 
Mouza, Yang, Pan, Ozden, and Pollock (2017) Teachers non-STEM Formal Survey 
Munoz-Repiso and Caballero-González (2019) K-elementary Robotics/Game and CS/ Formal Portfolio 
Programming 
Pala and MihciTürker (2019) Teachers CS/Programming Formal Survey 
Peel, Sadler, and Friedrichsen (2019a, 2019b) High STEM Formal Portfolio, Interview 
Pei et al. (2018) High STEM Formal Portfolio 
Pellas and Peroutseas (2016) High CS/Programming Informal Portfolio, Interview 
Peteranetz et al. (2017) College CS/Programming Fromal Traditional 
Pinkard, Martin, and Erete (2019) Elementary, Middle CS/Programming Informal Traditional, Portfolio, 
Interview 
Pugnali et al. (2017) K-elementary Robotics/Game and CS/ Informal Portfolio 
Programming 
Rodriguez-Martinez, Gonzalez-Calero, and Elementary Formal Traditional 
Sáez-López (2019) 
Roman-Gonzalez et al. (2017a, 2017b) Elementary, Middle, Formal Traditional 
High 
Román-González, Pérez-González, Moreno-León, Elementary, Middle, Formal Traditional, Survey 
and Robles (2018) High 
Rose, Habgood, and Jay (2017) Elementary Formal Portfolio 
Sáez-López et al. (2016) Elementary non-STEM Formal Traditional, Portfolio, Survey 
Sengupta et al. (2013) Middle STEM Both Traditional 
Shell and Soh (2013a, 2013b) College CS/Programming Formal Traditional 
Sherman and Martin (2015) College CS/Programming Formal Portfolio 
Sung, Ahn, and Black (2017) K-elementary STEM and CS/Programming Informal Traditional, Portfolio 
Taylor and Baek (2019) Elementary Robotics/Game and CS/ Formal Traditional, Portfolio, Survey 
Programming 
Thomas, Rankin, Minor, and Sun (2017) Middle Robotics/Game Informal Portfolio 
Tran (2019) Elementary CS/Programming Formal Traditional, Survey, 
Interview 
Tsai, Shen, Tsai, and Chen (2017) College CS/Programming Formal Traditional 
Turchi, Fogli, and Malizia (2019) High Informal Portfolio 
Weintrop et al. (2016) Middle, High, Robotics/Game and STEM Informal Portfolio, Interview 
College 
Wilkerson-Jerde (2014) Middle STEM Formal Traditional, Portfolio 
Wolz et al. (2011) Middle non-STEM Both Survey, Interview 
Wong and Cheung (2018) Elementary CS/Programming Formal Interview 
Wu (2018) Middle Robotics/Game Informal Portfolio, Survey 
Wu, Hu, Ruis, and Wang (2019) College CS/Programming Formal Interview 
Yadav et al. (2014) Teachers non-STEM Formal Traditional, Survey 
Yagci (2019) High Survey 
Yildiz, Yilmaz, and Yilmaz (2017) Middle Robotics/Game Informal Survey 
Yuen and Robbins (2014) College CS/Programming Formal Interview 
Zhong et al. (2016a, 2016b) Elementary CS/Programming Formal Portfolio 
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Appendix B. Sample Questions for Each Category of Constructs 
I. First-order cognitive CT construct 


Beaver paddles in his canoe on a river. The river has a number of little lakes (Fig. 1). Beaver likes all lakes of the river and has 
thought of an algorithm to make sure that he reaches every lake. He knows that at each lake there is a maximum of two rivers that he 
has not yet seen. If beaver arrives at a lake he decides which river to take with the following rules: 


e If there are two rivers he has not yet seen, he takes the river on his left hand side. 
e If there is one river which beaver has not yet seen, beaver takes this river. 
e If beaver has seen all the rivers from a little lake, he paddles his canoe one lake back towards the previous lake. 


START T 
Fig. 1. Task “Beaver in his canoe". 


Beaver stops his day of canoeing if he has seen everything and has come back to the start point. In Fig. 1 you can see the river and 
the little lakes where beaver paddles his canoe. 

In each little lake beaver sees a different animal. Beaver writes down the animal name when he sees an animal for the first time. 

In which order will beaver write down the animals? 

Answer (the correct answer is written in bold). 


a Fish, frog, crocodile, turtle, stork, snake, otter, duck. 
b Fish, crocodile, snake, stork, duck, otter, frog, turtle. 
c Fish, frog, turtle, crocodile, stork, otter, duck, snake. 
d Fish, frog, turtle. 


Source: Dolgopolovas et al. (2016). 


II. First-order non-cognitive CT construct 


. Computational thinking is understanding how computers work. 

. Computational thinking involves thinking logically to solve problems. 

. Computational thinking involves using computers to solve problems. 

. Computational thinking involves abstracting general principles and applying them to other solutions. 
. I do not think it is possible to apply computing knowledge to solve other problems. 

. I am not comfortable with learning computing concepts. 

. Ican achieve good grades (C or better) in computer courses. 

. I can learn to understand computing concepts. 

. I use computing skills in my daily life. 

. I doubt that I have the skills to solve problems by using computer applications. 

. I think computer science is boring. 

. The challenge of solving problems using computer science appeals to me. 

. I think computer science is interesting. 

. Iwill voluntarily take computing courses if I were given the opportunity. 

. Computational thinking can be incorporated in the classroom by using computers in the lesson plan. 
. Computational thinking can be incorporated in the classroom by allowing students to problem solve. 


o do 0i -»0l.K- 


PRP RPP PR 
DAuobhwWwnNrROWO 
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17. Knowledge of computing will allow me to improve my performance in my career. 

18. My career does not require that I learn computing skills. 

19. I expect that learning computing skills will help me to achieve my career goals. 

20. I hope that as my career continues it will require the use of computing concepts. 

21. Having background knowledge and understanding of computer science is valuable in and of itself. 


Note: It’s a 4-point Likert-type scale that ranged from 1 = strongly disagree, 2 = disagree, 3 = agree, and 4 = strongly agree. 
Source: Leonard et al. (2018). 


III. Second-order programming/computing/CS construct 


This code below does not work. Can you figure out 
why? [Note: This program is executed on a stage 
which has red bricks] 


m 


say (foin LTT THENumber for F) secs 


When the code above is executed, what is value of 
'THE-Number' at the end of the script for the 
following inputs after ‘counter’ is set to 3- 


Source: Grover et al. (2015). 
IV. Second-order non-CS STEM construct 


Find at least one example of every possible triangle area in the given 4 x 4 lattice (there are 16 possible areas). It asks students to 
explore shapes and size of lattice triangles and discover (or re-discover) some unexpected results. 


(a) (b) (c) (d) 


Sources: Pei et al. (2018). 


V. Second-order non-STEM construct 
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Exercise 3: Debugging 
Computational Thinking in English 


Debugging is when we look for 
problems when writing a computer 
program. The robot writing Haikus in 
the previous exercise has made a few 
mistakes and the poems are not 
displaying correctly. 


Look at the poems to the right and 

circle the errors in the programs to 

the left that have made each poem. 
One has been done for you. 


Start Poem 
Repeat (Syllable, 5) 

New Line 
Repeat (Syllable, 7) 


New Line - 
Repeat Syllabel, P) 


End Poem 


Up on the hill top 
Stood a tall tree and a rock 


Question 1 


Start Poem 
Repeat (Syllable, 5) 
Mew Line 
Repeat (Syllable, 7) 


Up on the hill top 
Stood a tall tree and a rock In the desert 
sun 
Repeat (Syllable, 5) 
End Poem 


Start Po 
Syllable (Repeat, 5 


Mew Line 
Repeat (Syllable, 7) 


New e - 
Repeat (Syllabel, 5) 
End Poem ` : 


p 
Stood a tall tree and a rock 
In the desert sun 


VI. Second-order non-cognitive programming/CS/computing construct 


1) You created a mini science project using the scratch today. Now, how do you perceive Programming as an educational tool? 


2) What challenges do you foresee in incorporating programming into your teaching? 

3) Do you think programming can improve students’ critical thinking or modify any misconceptions? 
4) Do you have any future project ideas using Scratch programming for your teaching? 

5) On a scale from 1 to 5, how likely are you to incorporate programming when teaching: 


6) Did you have any misconception about the Earth’s rotation around the sun before this activity? What about now? 


Source: Adler and Kim (2018a, 2018b. 


17 


X. Tang et al. Computers & Education 148 (2020) 103798 


VII. Second-order non-cognitive non-CS construct 


Understood artistic elements in paintings. 

Learned biographical and historical contents of Spanish painters. 
Increased cultural and artistic competence to understand paintings. 
Improved the ability to understand artistic expressions from different eras. 
Analyzed historical and artistic content in paintings. 


yk wnr 


Note: It’s a 5-point Likert-type scale with 1 as strongly disagree and 5 as strongly agree. 
Source: Sáez-López et al. (2016) 
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